Analytical Solution of a Stochastic Content Based Network Model 



in 

o 
o 



^ 



Muhittin Mungan^, Alkan Kabakgioglu'^, Duygu Balcan'^, Ayse Erzan^^'^ 

^Department of Physics, Faculty of Arts and Sciences 

Bogazigi University, 34342 Bebek Istanbul, Turkey 

^ Dipartimento di Fisica, Universitd di Padova, F35131 Padova, Italy 

^Department of Physics, Faculty of Sciences and Letters 

Istanbul Technical University, Maslak 34469, Istanbul, Turkey and 

'^Giirsey Institute, P.O.B. 6, Qengelkoy, 34680 Istanbul, Turkey 

(Dated: February 9, 2008) 

We define and completely solve a content-based directed network whose nodes consist of random 
words and an adjacency rule involving perfect or approximate matches, for an alphabet with an 
arbitrary number of letters. The analytic expression for the out-degree distribution shows a crossover 
from a leading power law behavior to a log-periodic regime bounded by a different power law decay. 
The leading exponents in the two regions have a weak dependence on the mean word length, and an 
even weaker dependence on the alphabet size. The in-degree distribution, on the other hand, is much 
narrower and does not show scaling behavior. The results might be of interest for understanding 
the emergence of genomic interaction networks, which rely, to a large extent, on mechanisms based 
on sequence matching, and exhibit similar global features to those found here. 
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I. INTRODUCTION 

In a previous paper, two of us (Balcan and Erzan) L] 
introduced and numerically simulated a content based 
network |3], with random binary strings associated with 
each node. The network arose by postulating a directed 
edge to exist between the nodes i and j, if and only if 
the string, which can be regarded as a random word as- 
sociated with the ith node, occurred at least once in the 
random word associated with the jth. 

This stochastic network was shown |l| to display dis- 
tinctly different topology than either the classical ran- 
dom networks of Erdos and Renyi Q or the "scale 
free" networks of the preferential-attachment universal- 
ity class, introduced by Barabasi and Albert iJ,l3]- Sim- 
ulations 0] revealed that the in- and out-degree distri- 
butions, were markedly different, with in-degree distri- 
bution being rather localised. The out-degree distribu- 
tion displayed a sharp crossover behavior. For small out- 
degree d, the distribution n[d) exhibited a putative scal- 
ing behavior over a very narrow region, where the log- 
log plot could be fitted with a straight line with slope 
—71 ~ — 1, whereas, for larger d, log-periodic oscillations 
were found, with an envelope which could again be fit- 
ted, on a double logarithmic plot, by a linear graph with 
a slope —72 — —1/2. 

The purpose of this paper is twofold. We first extend 
the model of Balcan and Erzan 1] to a broader class of 
models in which the random strings are derived from an 
r -\- \ letter alphabet and where partial matches are al- 
lowed. Second, we obtain analytical expressions for the 
ensemble averaged in- and out-degree distributions and 
investigate the crossover behavior of the out-degree dis- 
tribution. We show that the putative scaling behavior 
observed in the simulations to coincides with the leading 
power law behavior obtained from our analytical results. 
We describe in detail the finite size corrections to the 



infinite network limit. Comparison of our analytical pre- 
dictions with the numerical data [l| for the r = 2 random 
bit string model with perfect matches shows very good 
agreement. 

The paper is organized as follows: In the next section 
we reformulate the random string model of Jj for an al- 
phabet of r + 1 letters. Our analytical results depend on 
the matching probability p{l, k) that a string of length I 
selected randomly from the set of all strings of length / is 
contained at least once in a string of length k, k > I, that 
has been selected randomly from the set of all strings 
of length k. In Section III we derive an approximate 
form for this probability that is valid for moderately long 
strings k ^ r and that allows for partial matches. Using 
the results of Section III, we obtain in Section IV analyt- 
ical expressions for the in- and out-degree distributions. 
We investigate the scaling behavior of the out-degree dis- 
tribution in these models and compare our results with 
the numerical data of [l|. We conclude this paper with 
a discussion of the possible relevance of our results to 
genomic networks, in Section V. 



II. THE RANDOM STRING MODEL 

Consider a random sequence C of fixed length L, con- 
sisting of letters from an alphabet A oir + 1 letters. The 
elements of the sequence C, x G {0, 1, . . . , r} are assumed 
to be independently and identically distributed according 
to 

1 '■"^ 
P{x) = pS{x - r) + (1 - p)- Y^ d{x ~ m) . (1) 

m=0 

A subsequence d of C, composed of the letters 
{0, . . . , r — 1} only, sandwiched between the i*'* and 
{i + ly^ occurrences of the letter "r," will be denoted 



the ith "random word," or "string," and will be associ- 
ated with the ith vertex of a graph. For convenience, we 
assume that a letter "r" has also been placed at the 0*'' 
and {L + 1)*^ positions. With these definitions, the 
string can be written. 



■th 



G. 



'^i,l 1 •^i,2 7 ■ 



,Xi,l, 



1,2, 



.N 



(2) 



where N is the number of strings (equivalently, vertices), 
the "letter" Xi^x € {0, r — 1}, A = l,...,£i, and £i is 
the length of the i*^ string Gi. Let ne be the number of 
strings of length £ and q = 1 — p. It follows that 



J2^^ = L-N, ^n, = 7V, 



(3) 



{£)=p-'-l, {n,)=Lp'q\ {N)=Lp. (4) 

Unless noted otherwise, we will assume that L and Lp 
are sufficiently large so that fluctuations in the number 
and length of the strings for different realizations of the 
random sequence C can be neglected when calculating 
statistical properties of quantities of interest. We will 
also discard the cases with £ — and construct the graph 
from the remaining vertices. The adjacency matrix is 
defined by the matching condition 



Wi 



1 Cri C Crj , 

otherwise. 



(5) 



By Gi C Gj we mean that there exists an integer A such 
that 0< \< £, -£, and 



Xil 



f^J.A+i, 



/ = 1, 



(6) 



Two vertices are said to be connected if the string Gi 
appears as a subsequence of Gj, or in other words Gj 
matches Gi. Thus Wij = 1 indicates a directed link (an 
edge) from Gi to Gj. We will also consider imperfect 
matches, where Eq. © is valid only for some values of 
I rather than all values. In order to avoid ambiguity 
we will refer to the former case as a perfect match. For 
Lp large enough {p > pdL), see Q), which is assumed 
here, the graph consists of one giant cluster. We will 
henceforth refer to this graph as the network, and denote 
the vertices, or equivalently, the strings associated with 
them, as the "nodes." 

The resulting network was numerically studied earlier 
by Balcan and Erzan in U^ , for the case of binary strings, 
i.e., r = 2, and perfect matches Eq. ©, where it was 
shown that the logarithm of the out-degree distribution 
behaved linearly over a very narrow, initial range, with 
a slope of ~ —1. Beyond a crossover point the distri- 
bution exhibited an oscillatory behavior, whose envelope 
again behaved linearly on a log- log plot, with a different 
slope, namely ~ —1/2. The out-degree distribution is 
shown in Figure (Q , where the numerical results were ob- 
tained ^j by averaging the out-degree distributions over 



500 graphs, associated with independently generated se- 
quences of length L = 15000, and p = 0.05. Notice 
the strong oscillatory behavior. It turns out that each 
peak in the out-degree distribution is supported predom- 
inantly by the out-degrees of genes with a corresponding 
common length I. 
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FIG. 1: Scaling behavior of the out-degree distribution. The 
numerical data (circles) shows a cross-over in the scaling be- 
havior from small values of the out-degree to larger values. 
The solid line is the theoretical expression. The dashed hues 
serve as a guide to the eye for the predicted scaling behav- 
ior and have been offset for clarity. The cross-over occurs at 
dc — 6.6 and has been shown as a vertical line. 

In order to proceed with the analytical treatment, it 
is convenient to group the Gi into subsets according to 
their lengths and we define 



Gi ^ {G,\£, = 1} . 



(7) 



It turns out that that the central quantity determining 
the behavior of the in- and out-degree distributions is 
the probability p{l, k) that a string in Qi has an outgoing 
edge terminating in a member of Qk . We therefore turn 
next to the derivation oi p{l,k). The discussion of the 
degree distributions will then be taken up in Section IV. 



III. ANALYTICAL RESULTS FOR THE 
MATCHING PROBABILITY 

Let X, and y be variables such that x,y,<E {0, . . . , r— 1}. 
Define an interaction u(x,y) between x and y as 



u{x,y) = 1-S{x~y). 



(8) 



Let X = {xi,X2,X3,...,xi) and y = (2/1,2/2,2/3, ■■• ,yO: 
be two strings of I letters and define their interaction 
L/(x,y) as 



U{x,y) =^u{xt,yt). 



(9) 



The function t/(x, y), as defined above, counts the num- 
ber of unmatched letters between strings x and y. 

Introduce an "inverse temperature" P and consider the 
Boltzmann factor e~^^ . In the "zero temperature" limit 
we have 



/3-+00 I 0, otherwise. 



(10) 



We see that the limit /3 — > 00 is a "no tolerance" limit [g , 
enforcing perfect matching of x and y, i.e. xt = yt, t = 
1, 2, . . . , L Let y = (2/1, 2/2, • ■ • , yk) be a string of length 
k > I and denote by ya.i = (2/0+1,2/0+2, ■ • • , Va+i) the sub- 
string of length / starting at position a, a = 0, 1, ... , k — l. 
Furthermore let 



/,(x,y;/3)=e-'3^(-^y-'). 



(11) 



so that we have 



Thus, /a(x, y) = 1, if and only if x matches y at position 
a, and zero otherwise. 

Likewise, let /(x, y) be a function that takes on the 
value one if the fc-string y contains the given Z-string 
X and zero otherwise. Note that the complement of the 
event that x matches y is the event that x docs not match 
y anywhere. Thus, using Eq. I|12|) . we can write 



*:-; 



/(x,y) = l-n[l-/a(x,y)] 



(13) 



Letting p(/, fc; x) denote the probability that a randomly 
drawn fc-string y contains a given Z-string x, we therefore 
find 

k-l 

p(/,fc;x) = l--^^[][l-/,(x,y)], (14) 

y a=0 

where r'^ is the number of distinct fc-strings of r-letters, 
and J2y denotes the sum over all such strings y. 

Generalizing the above equation to incorporate partial 
matches we obtain: 



pil,k;x) = lim p{l,k;x, (3), 



(15) 



where 



The products in equation H16() can be expanded and 
we obtain a Mayer-like sum 

p(/,fc;x,/3) = -L^^/,.--L^^/,/, 

y a y a<b 

+ ;iE E fahfc----, (17) 

y a<b<c 

which we can write as 

p(/,fc;x,/3) = ^Ty(i)(a;x)-^iy(2)(^^^.^) 

a a<b 

+ Y. W^'\a,b,c;K)-..., (18) 

a<b<c 

where 

W^'\a;x) = -L^/,(x,y;/3) 
y 

W^-^Ha,b;x) = -^^/,(x,y;/3)/b(x,y;/3) 
y 

W'^''\a,b,c;x) - -^^/,(x,y;/3)/b(x,y;/3)/,(x,y;/3) 
y 

(19) 

Using equations © and 111|) . we obtain 

VF(i)(a;x) = 1 [1 + (r - l)e-^]' = W'^^\ (20) 

Note that W'^^^a; x) is independent of a, and x. 

Let us now turn to the second order term, H/t^' (a, b; x) 
in Eqs. p8|l and p9|l . Here, we need to distinguish two 
cases, (i) b — a> I and (ii) b— a < I. 

In case (i), the set of indices of yo,; and yb,i are dis- 
tinct and the evaluation of the partition sum proceeds 
analogously to equation (|20|) yielding 

W^^\a, b; x) = (^1^ [1 + (r - l)e-Pf ,\b^a\>l. 

(21) 
In case (ii), |6 — a| < I, there is an overlap between the 
indices of ya,i and yb,i. Letting |6 — a| = m, we find 

l — m 

X n [1 + (^ - ^>''^ - <^uxm+t) (1 - e-Pf] , 
t=l 

\b-a\ <l.{22) 

Note that W^'^'>{a,b;x), as defined Eqs. (ETJ and it^ . 
depends on x only when |6 — a| < I. Next, we perform 
the X average of W^"^^ {I, k; x). 



1 ''"' 

p(Z,fc;x,/3) = l--^^n[l-/'^(^'y'/')]- (16) i^VF(2)(a,6;x) = 47[l + ('--l)e~''] 



2/ 



(23) 



y a=0 



The calculations leading to Eqs. (|22|l and H23|l are a little 
involved and can be found in the appendix. 

Comparing Eqs. (PHf and (J2S1), we see that once aver- 
aged over X, M^(^^ factorizes as 

^(2) ^ (w(^)^a,b;^)) = (W^^^y, (24) 

or equivalently, 

{fafb)y.^ = {fa)y^^ {fb)y^^ , tt^b, {25) 

where, for simplicity, we have introduced the short hand 
notation (..•)„ 3, to denote averaging first over y then x. 
Let us therefore make the approximation that all 
higher moments factorize similarly, 

\Ja1Ja2 ■ ■ ■ Jas)y,x — \Jai)y,j. \Ja2ly^x ' ' ' \J<is)y,x ' V^"J 

with {as} being distinct. It can be readily shown that 
Eq. H2t)|) is exact when Ui^i — ai > I, i.e, there are no 
overlaps between the segments at position a^. Upon sub- 
stituting Eq. (|25|l into Eq. H17|) and performing the x 
average we obtain the matching probability 



pa,fc;/3) = (pG,fc;x,/3))^, 



with 



p{l,k-p) = l-[l-\[\ + {r-\)e-P]' 



k-l + l 



(27) 



(28) 



In the "zero temperature" limit (/?—!■ cx)), this becomes 

k-l+l 



p(Z,fc) = l-(l-i 



(29) 



For r' ^ k, p{l, k; f3) has the asymptotic form 
p{l, k;P) = l- exp (-^—Ltl [1 + (r - l)e-^] ' 



k-l + l 



which for /3 — > cx) becomes 

p{l, k) = 1 — exp 

For very large / this further reduces to 

k-l + l 



p{l,k) = 



(30) 
(31) 

(32) 



Note that a finite (3 acts like an enhanced matching 
probability, i.e., a false positive match. In the limit f3 -^ 
0, the matching probability becomes 



lim p(l,k:l3) = 1 



(33) 



Hence the "high-temperature" limit of our model corre- 
sponds to indiscriminate matches. 
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FIG. 2: Comparison of the exact matching probability p(k, I) 
(circles) with the approximate expression 12911 (lines) for r = 2 
and perfect matches. The curves are (from top to bottom) for 
values of it = 16, 14, 12, 10, 8, 6, 4, 2. 



Of course, the crucial approximation, Eq. (|26() . is not 
correct in general and one expects corrections coming 
from higher order correlations contained in Eq. I|18|l . 
These correlations are due to the fact that if a given 
string x is matched at a position a, this affects the likeli- 
hood of matching the same string at any nearby location 
h with |& — a| ^ I. Nevertheless, the approximate result 
for p(/,fe), Eq. (I^ni, is surprisingly good. Fig. |(2Jl shows 
a comparison of the matching probability obtained from 
exact enumeration carried out computationally, with the 
analytical expression (|29|l for r = 2 and perfect matches. 
As can be seen from the figure, there are only very small 
discrepancies for small I when fc > 2', e.g. data points 
around k — 16, 14, 12 with Z = 4, 3, 2. Since our expres- 
sion for p{l, k), Eq. (|29|l . is exact for / = 1, there are no 
discrepancies at / = 1. 

Notice that Eq. H32|) is the matching probability that 
can alternatively be obtained by assuming the probabil- 
ities of matching a string of length I at any position in 
a string of length k are independent, and equal, 1/r'. 
Eq. 129() . on the other hand, is the matching probability 
that can also be found assuming the probabilities of not 
matching a string of length I at any position in a string 
of length k are independent and equal, 1 — 1/rK Thus the 
factorization approximation, Eq. (|26|l . leading to Eq. I|28() 
implies that the probabilities of not matching at a given 
position arc independent. 

For the regime of interest, fc ^ 2', this approximation 
leading to Eq. H29I) is extremely good. We think that 
this is due to the fact that the factorization property 
underlying our approximation, Eq. (|26f) . is exact for the 
two-point correlation function (s = 2), Eq. (|24|) . This 
means that any corrections to this result must be coming 



from higher order correlations with strongly overlapping 
segments, since non-overlapping segments will factorize 
and thus reduce to lower order correlators. This is very 
similar to the connected cluster expansion in statistical 
mechanics j^j. Indeed, such an expansion can be set up, 
however the calculations are rather tedious due to the 
discreteness of the problem and beyond the scope of this 
paper. Yet it is clear that the weight of an s-point corre- 
lation function with s overlapping (connected) segments 
must be very small for large s, since the overlap imposes 
very strong conditions on the structure of the string x to 
be matched. 

For the remainder of the paper it is convenient to define 
the quantities t and z as 



z - - [1 + (r - l)e-^] , 



1-z' 



(34) 
(35) 



of the two sequences, apart from the obvious requirement 
that I < k. The approximation to which we have to re- 
sort in the final solution works best when either the two 
sequences are almost of the same length, or if fc ^ r'. 
Moreover there is no assumption regarding the number 
of times the highest score is achieved. More interestingly, 
the statistics of multiple high-scoring segments 0| could 
have been related to the out-degree statistics of a given 
node had we taken each high scoring match in the com- 
plete random sequence to correspond to a different edge. 
As it is, a single edge corresponds to the presence of one 
or more occurrences of a shorter string, say Gi, inside 
a longer string Gj. That is, multiple occurrences of the 
shorter string within a subsequence of the complete ran- 
dom sequence are bunched together to result in a single 
edge between the nodes i and j. 

We now turn to the calculation of the in- and out- 
degree distributions. 



where we have suppressed the /3, r and I dependence 
for clarity. Notice that the effect of the number of let- 
ters in the alphabet r and the extent of mismatch as 
parametrized by the "inverse temperature" (3 enter into 
the expression for p{l,k]P) as a single parameter, z, as 
defined above. With the above definitions, Eq. H28|) be- 
comes 



p{l,k;z) 



\-t 



k-l+l 



1 



1 



,k-l+l 



(36) 



The "zero-temperature" limit is given by z = r~^ , while 
the "high-temperature" limit is z = 1. The range of z 
is therefore, z G (r^^,l), which for r ^ 1, approaches 
ZG(0,1). ^ 

We note in passing that the matching probability com- 
puted in this section, is in a sense complementary to the 
problem of sequence alignment [^ |9| , which has impor- 
tant applications in the study of proteins and DNA. The 
problem there is to identify subsequences of arbitrary 
length, showing strong similarity beyond pure statistical 
chance, within two long sequences sampling the same al- 
phabet, possibly with different native probabilities. The 
pioneering work of Altschul, Karlin, et al. |3, Q yields 
a probability distribution for the similarity score of such 
likely regions, under the assumption that the region with 
the highest score is unique (i.e., non-degenerate), that 
the two sequences searched are of comparable length, and 
sufficiently long. The scoring scheme is to a large extent 
arbitrary as long as the scores corresponding to some de- 
gree of matching are rare (and positive) while those cor- 
responding to mismatches are much more probable (and 
negative). This arbitrariness may be removed by proper 
normalization and scores obtained via different schemes 
can be compared in a meaningful way. The matching 
probability computed in the present paper could be re- 
lated to the probability for the highest score (correspond- 
ing to an exact match without gaps) , holding for the en- 
tire length of the shorter sequence. However, our calcula- 
tion makes no assumptions regarding the relative lengths 



IV. THE DEGREE DISTRIBUTIONS 

In Section II we showed that the subsequences {Gi} of 
a random sequence C generate a network whose nodes are 
associated with these strings, and whose edges are defined 
by the matching relation Eq. (jSJ. In this section we will 
derive the in- and out-degree distribution associated with 
this network. 

Consider a randomly selected string Gi. The in- and 
out-degree of the corresponding node, din(i) and (iout(*), 
are defined by the total number of edges terminating in 
and originating from that node, respectively. 



d\n{i) = X! 

3 
dont{i) = ^ 



uji 



Wi. 



(37) 



The corresponding in- and out-degree distributions are 
given by 

Uinid) = y^^d{d-din{i)) 

i 
i 

A. The Out-Degree Distribution 

Letting Qi denote the set of strings of length I, we can 
rewrite the out-degree distribution Eq. (|^ as 



n-out(rf) = y^"-; 
1=1 



lY.S{d-dout{j)) 



(39) 



For large ni, the quantity in parentheses will approach 
the (conditional) probability Pouti^i = d\l) that a ran- 
domly selected string whose length is given to be I has 
an out-degree d. 



In the limit L,N ^ oo, such that N/L — p, the ratio 
of the number of strings, N, to the length of the whole 
random sequence, L, remains constant, all the possible 
r' realizations of random words of a given length I will 
be present with equal respective weights and we have, 



1 



lim 

L,N-too niUk 



EE 



w,. 



:p{l,k). 



(40) 



We will refer to this limit as the large-L limit. 

The quantity p{l, k), as defined in the above equation, 
is the probability that a randomly selected string of given 
length / matches another independently and randomly 
selected string of length k. This probability has been 
calculated in Section III for the general case of imperfect 
matches, Eq. lP5|l . as well as perfect matches, Eq. ((23 • 
Eqs. l|5^ and (jlUl) show the self-averaging property of 
the degree distribution in the large-L limit. 

Define the random variable Xik, as the number of edges 
originating from a randomly selected string of length I 
that terminate in strings of length k. Then Xi can be 
written as a sum of the random variables Xik, 



X, 



E^^ 

k>l 



Ik- 



We can therefore write Xj as 



(^')-;J:EEE 



or. 



(^0 = E 



ni 



nk 



i&Qi k=l j£Qk 



-EE 

iTli. ^-^ ^-^ 



k=l 



niUk 



i^Si jeQk 



(41) 



(42) 



(43) 



We see from Eqs. (03) and (gDJ that in the largc-L 
limit 



and 



(Xik) = nkp{l,k), 



{Xi)=Y,nkPil,k), 

k=l 



(44) 



(45) 



where (• • • ) denotes an average over all the strings of 
length I in the complete random sequence. Note that in 
the large-L limit Xik is binomially distributed. 



PiXik = d\l) = 



nk 



pil,kfil~pil,k)Y 



(46) 



As can be seen from Eq. I|41() , Xi is a sum of the ran- 
dom variables Xik and thus in the large-L limit the cen- 
tral limit theorem assures that the distribution for Xi 
will approach a Gaussian distribution. 



PoutiXl = d\l) = 



1 



2tt(ti 



exp 



{d-diY 



2af 



(47) 



whose mean di and standard deviation ai are given by 
those of Xik, Eq. (|41l) . according to: 



= {Xi)=J2{Xik) 

k>l 
r2\ lv\2 



a] = (Xf)-(X,)^=^(af,), 



k>l 

where 

For binomially distributed Xik we have 

{Xik) = nkp{l,k) 

^Ik = nkp{l,k){l~p{l,k)). 



(48) 
(49) 

(50) 



(51) 
(52) 



Using Eq. (|36(l . one can readily carry out the sums in 
Eqs. igHl) and (gnj to find 



di = 



N 



<^i 



P 
di 



pt 



(Q^y 



l^qt' 



(53) 

(54) 



Noting also that the probability of selecting a string of 
length I is pq^ , the total out-degree distribution is given 
by 



Poutid) 



L 

E 

1=1 



pq' PoutiXl =d\l). 



(55) 



and thus in the large-L limit we obtain 



Pontid) = Y^pq' 
1=1 



2n(Ti 



exp 



(d-diY 



2af 



(56) 



with di and ai given by Eqs. itS^ and (|31|l . respectively. 
As I becomes large, p{l, k) decreases towards zero. 
Thus with increasing I the binomial distribution of Xik , 
Eq. (|45|l . will approach a Poisson distribution of the same 
mean. Note that the sum of independent and Poisson 
distributed random variables is also Poisson distributed 
with mean equal to the sum of the individual means. 
Thus for large /, Xi as defined in Eq. (|41|1 . is Poisson. 
For a Poisson distributed random variable the variance 
equals to its mean so that for large I we expect 



af = di 



(57) 



as can also be directly verified by taking the appropriate 
limit in Eq. (|51ll . 



1. Ensemble Averages and Finite Size Effects 

The numerical data of |l?| has been obtained from aver- 
aging over 500 realizations of a random sequence of length 



L — 15000 with N = 750. A finite sample size will cause 
sample to sample fluctuations in the number of strings, 
or "random words." An average over a large ensemble of 
different realizations will yield the same average values 
for the out-degrees as those obtained from a single ran- 
dom sequence of infinite length. However averaging over 
many realizations will increase the fluctuations around 
the mean. It is not hard to see that this will affect pre- 
dominantly nodes with large out-degrees, (short strings) 
where there is already self-averaging within the random 
sequence, but with a distribution which varies from sam- 
ple to sample. 

Nodes with small out-degrees (long strings) correspond 
to rare matches and thus for these nodes there is no self- 
averaging within the sample. To see this, consider the 
extreme case, where a sample contains on average one or 
less matches for such a node. When an ensemble aver- 
age is taken, the dominant contribution to the variance 
of the out-degree will come from the sample to sample 
fluctuations. 

Denoting the mean and variance of the out-degree of 
a node of length I, that has been corrected for the finite 
size, by di and a^ , respectively, we have 



di ^ di, af 



af for large I. 



(58) 



In what follows we will re-calculate previously introduced 
statistics, taking into account the fluctuations in nk- In 
order to avoid confusion, these quantities will be denoted 
with a tilde. 

We can estimate af as follows. The random variable 
Xik itself is a sum of random variables: 






(59) 



Finding the distribution of a sum over a finite random 
number n of independently distributed random variables 
Y can be readily worked out using moment generating 
functions (see for example Feller Uj). In the case when 
both fik and Yij are binomial it turns out that the result- 
ing distribution is binomial again, and we find 

P{Xik ^d\l)^(^^^ [pq'^pil, k)] ' [1 - pqf'pil, k)] ''-'' , 

(65) 

(66) 

(67) 



with mean and variance 



Xik) = Npq''p{l,k) 



^fk 



Npq''p{l,k)[l-pq''p{l,k)] 



Thus Eq. H65|l is the finite size result replacing Eq. H46|l. 
which is valid in the large-L limit. As remarked before, 
the means of the two distributions in Eqs. (|48|l and H66|) 

are equal, i.e., (Xik) = (Xik). However, the variances 



are different and af)^ 



Eq. (EI) is of the order of (1 
we find to order p 



< afj, . Note that the second term in 



p) « 1 for small p. Thus 



^fk 



X, 



Ik 



(68) 



and consequently to this order the mean and variance 
Xi become 



of 



^ (Xi) = {Xi) = di, 



(69) 



where di is the same mean out-degree that was previously 
obtained in the large i-limit, Eq. (|53|) . The out-degree 
distribution corrected for finite-size effects thus becomes, 

c.f., Eq. (ingi, 



where Yij = 1 if the string Gj of length k matches the 
(given) string of length I and zero otherwise. Such an 
event constitutes a Bernoulli trial and its probability is 
p{l,k). The mean and variance of Yij are given by 

(Yi,) - P{l,k) (60) 

(ll,2)_(yy)2 = p(l^k){l-p{l,k)) (61) 

The number of such trials is hk , the number of elements 
of Qk, and hence fik itself is a random variable. For suf- 
ficiently large N and for values of hk near the mean, the 
constraints, Eq. Q), can be neglected and the probability 
of finding fik strings of length k is approximately bino- 
mially distributed 



P{hk 



We thus find 



{nk) 



[pq 



Npq" 
pq\l 



(1 



, AT-r 



pq 



pq''). 



(62) 



(63) 
(64) 



Pn, 



^(rf) = E 



1 



pq 



1=1 



viinit 



exp 



(d- 



2di 



(70) 



Comparing this expression with the distribution obtained 
in the large-L limit, Eq. H56|) . we find that finite size cor- 
rections are only present for small /, since we have al- 
ready shown that the relation di = af is also valid (viz. 
Eq. lf37|) ') in the large I region for the large-L case. Fig- 
ure © shows a comparison of the numerically obtained 
out-degree distribution (circles) with the theoretical ex- 
pressions with and without finite size corrections. The 
solid line is the analytical result for the out-degree dis- 
tribution, Eq. (|70|l . that takes into account finite size 
corrections, while the dotted line corresponds to the case 
where the network is assumed to be self- averaging, i.e., 
Eq. (|56ll is satisfied, and thus sample to sample fluctua- 
tions can be neglected. Note the large difference from the 
observed behavior for I < 6, {d > 200) in the height and 
broadness of the distributions, when finite size effects are 
not taken into account. The agreement of the finite size 
corrected distribution with the numerical data, on the 
other hand, is rather good, and we conclude that finite 



size effects present in the numerical data for short nodes 
are satisfactorily accounted for. 




800 



FIG. 3: Comparison of the theoretical out-degree distribu- 
tions with numerical data (circles). The dotted line shows the 
theoretical result for a network in the large L-limit, where the 
network is self- averaging and thus all possible realizations of 
a string of a given length I can be found. The number to the 
right of each peak refers to the node length I that contributes 
predominantly to that peak. The solid line is obtained af- 
ter correcting for finite size effects (see text for details). In 
both cases, the locations of the peaks are accurately predicted. 
Note that the results for the large-L limit differ strongly in 
their predictions for the width and height of each peak for 
small Z, (large d). It is evident that the numerical data ex- 
hibits finite size effects for short nodes, Z 5-, 6. 



The location of the peaks, di , coincide very well with 
the numerical data and we find indeed that each peak 
corresponds to the out-degree of nodes of a given length 
I. The locations of the peaks decreas exponentially with 
increasing /. The labels next to each peak show the string 
lengths I contributing predominantly to that peak. 

Our reasoning above already shows that the oscillatory 
part of the out-degree distribution is highly succeptible to 
finite-size effects. It turns out that these oscillations are 
less pronounced or completely absent when single finite- 
size realizations of the network are considered. In other 
words, these oscillations become apparent only when av- 
eraging over many finite-size realizations, as we have done 
in our analysis. 

We turn next to a discussion of the scaling behavior. 



2. Scaling Behavior 

Our analysis shows that the out-degree distribution is 
a superposition of Gaussian peaks with mean di and a 
variance that depends on the strength of finite size effects, 
as discussed in the previous section. For large values of 
d, (small I) these peaks are well separated and one can 
readily obtain the envelope for the peaks. From Eq. H7U|) 
we see that the height Ei of a peak centered at di is 



E, 



Npq'' 



Using Eq. (|53(l . we obtain the scaling behavior 



E{d) « d-^^^ 



with 



72 



1 In z — In g 

2 In z + In g 



(71) 



(72) 



(73) 



For the bit string model with exact matches, i.e., for 
r = 2 and in the /3 — > oo limit, we find 



72 = t: 



lln 2 -H In g 
21n2-lng' 



(74) 



For the numerical data shown, q — 0.95, yielding 72 = 
0.43. 

For smaller values of d (large I), the analysis presented 
above ceases to be valid, since the peaks start to over- 
lap. In this regime, the contributions to the out-degree 
distributions come predominantly from matches between 
long strings which are rare. As was remarked previously, 
in this regime the distribution of Xi will be Poisson, so 
that we have 



Pid\l) = |< 



(75) 



with di as given before in H53|l . The out-degree distribu- 
tion for small d is thus given by 



p(rf) = EM'4e-'*' 



1=1* 



d\ 



(76) 



Since for small / the di values are quite large, the contri- 
butions from the small I terms will be suppressed heavily 
by the exponential factor, and therefore moving the cut- 
off Z* in the above sum down to 1 will not change the 
result of the summation significantly. Noting that for 
large I 



di^ — [qz] , 
P 



(77) 



we see that di and Ad; = di^i — di approach zero in a ge- 
ometric fashion. Thus the summation over I in Eq. H76|l . 



can be converted to an integration over x — di with 
Ax = di — di^i and we obtain 



p{d) 



j;'' 72 2e =^dx. 



where x* = di* and c is an overall numerical constant 



(78) 



P 



N 



In qz \ p 



-72 



(79) 



The dominant contribution to the integrand comes from 
X ~ d < X* and we therefore extend the upper limit to 
infinity obtaining 



Pid) 



r(d- 



72) 



r(d + i) 



(80) 



where T{x) is the gamma function. The leading order 
behavior of lnr(a;) is given asymptotically, for large x, 
by 



lnr(a;) 



-)lna;-j:+-ln27r + 0(- 
2 / 2 \ x 



(81) 



Using the above expansion, we obtain after a little bit of 
algebra 



lnp((i) = const. 



72 



D'^'+^'ii 



(82) 



It can be readily checked that this approximation for 
\np{d) is good even for small values of d and thus p(d) 
exhibits scaling behavior, p{d) ~ d'"^^ , with scaling ex- 
ponent 



71 



72 



(83) 



For the numerical data with z = 1/2 and q = 0.95 we 
find 71 = 0.93. 

As we have pointed out above, the cross-over between 
the two scaling regimes occurs when the depression (min- 
imum) between consecutive peaks disappears. This oc- 
curs roughly when 



1 



H+l 



V2^ 



> di- 



1 



i+i 



ndi 



yielding, via Eq. (|55|l . 



di > 



21- y/l-qz 



(84) 



(85) 



For the values of the parameters employed in the numer- 
ical simulations, this gives di > 6.59, Ind/ > 1.9, which 
is consistent with the data shown in figure (0). 

We can also infer the large r behavior of 71 and 72 for 
perfect matches. This corresponds to the case z — l/r, 
eq. (123. We find 



72 



1 In r -|- In <7 

2 In r — In q ' 



(86) 



and hence 



lim 72 



(87) 



and correspondingly 1/2 -I- 72 = 71 — > 1 in this limit. 
Thus, as the number of letters in the alphabet is in- 
creased, the scaling exponents 71 and 72, approach the 
values 1 and 1/2, respectively. Comparing with the val- 
ues for r = 2, we see that the dependence of 71 and 72 on 
r, the number of letters in the alphabet, is rather weak. 



B. The In-Degree Distribution 

Consider a randomly selected string Gi of length I. 
Then the random variable Xki that was introduced be- 
fore, counts the number of edges originating from a string 
of length k < I and terminating in Gi. Thus the in-degree 
of Gi is given by 



-'^in,; — / ^ Xkl ■ 



k<l 



The statistics of X^i and hence of Xin^i has been already 
obtained before and we find in the large-L limit. 



d\n,i = y^^nkpjkj) 

k<l 

t^hi.i = ^nkp{k,l){l-p{k,l)). 



(89) 
(90) 



k<l 



Noting also that the probability of selecting a string of 
length I is pq', the total in-degree distribution in the 
large-L limit is given by 



Pinid) = Y. 



1 



pq 



1=1 



2TTai 



exp 



2a?„ , 



(91) 



When taking into account finite size effects, the in- 
degree distribution becomes (c/. Section IV. A. 1) 



Pin(d)=^W 



27r(Ti 



■ exp 



in,/ 



id-din,iy 



2d: 



?r-2. 



(92) 



where 



~^Li - E ^m'p(^. (1 - M'p(fc, 0) ■ (93) 



k<l 



Unfortunately, we have not been able to obtain closed- 
form expressions for d^^^i and (yin,ii in a manner analo- 
gous to the expressions for the out-degree, Eqs. itS^ and 
(|^ . In the case of the in-degree distributions, Eq. llSS|) 
requires a sum over the first argument of the matching 
probability, p(-,-;z), Eq. H36|) . rather than the second 
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FIG. 4: Comparison of the theoretical in-degree distributions 
with numerical data (circles). The dotted line shows the the- 
oretical result for a network in the large L-limit, where the 
network is self-averaging and thus almost all possible realisa- 
tions of a string of a given length I can be found. The solid 
line is obtained after correcting for finite size effects (see text 
for details). 



argument, as was the case for the out-degree distribu- 
tion. Due to the complicated dependence of the match- 
ing probability on its first argument this sum is, as far as 
we can tell, intractable. The necessary summations were 
therefore carried out numerically. 

Figure Q shows a comparison of the two theoretical 
predictions, Eqs. (|91fl and H92|) with the numerical data 
of Balcan and Erzan [J . 

The in-degree distribution Eq. (|91|l . and its finite-size 
corrected form, Eq. H92|l , capture the qualitative features 
seen in the simulations. Although there are deviations 
for small and large values of d, we will not pursue this 
any further in the present paper. 

Note however the stark difference between the shape 
of the in- and out-degree distributions. Figs. lj2Jl and Q). 
Apart from the distinct qualitative features, such as os- 
cillatory behavior for small d (rather than large d as in 
the out-degree distribution), the in-degree distribution is 
much narrower than the out-degree distribution. 



V. DISCUSSION 

We have obtained analytical expressions for the in- 
and out-degree distribution of a contents-based network 
model which was introduced and studied numerically by 
Balcan and Erzan in [l| . We have shown that the behav- 
ior of the out-degree distribution can be divided into two 
regimes: a short and putative scaling regime for small 
out-degrees that crosses over into an oscillatory regime 



for large out-degrees. An analytical expression for the 
cross-over point has been obtained as well. We have 
found that the behavior of the out-degree distribution 
for large out degrees depends on the size of the network 
realizations from which the distribution was sampled. We 
have discussed these finite-size effects and have shown an- 
alytically how they effect the behavior of the out-degree- 
distribution. 

Our results were obtained for a generalized class of 
contents-based network models in which a small number 
of imperfect matches (finite, but low, temperature) were 
allowed and strings were constructed from an alphabet 
of r letters. It turns out, however, that such general- 
ization do not alter the main numerical findings of the 
network model of Balcan and Erzan which involved a 
two-letter (r = 2) alphabet and perfect matches. The 
scaling behavior which we have found, and even the nu- 
merical values of the leading scaling exponents 72 and 
71 =72 + 0.5 are robust under these generalisations. It 
should be noted that, in 



12 = 7; 



1 In z — In (7 

2 In z + In (7 



(94) 



we have 2 — > l/r for /? — > cx), while z ^ 1 in the "high 
temperature" limit /3 ^ 0, thus r^^ < z < 1. In the "low 
temperature," or perfect matching, limit /? ^ 00, 



72^(l/2)(l-p/lnr), 



(95) 



where p is a small number by assumption il|. Even when 
allowing for a small number of mismatches, 72 depends 
very weakly on r, and p. On the other hand, for either 
r ^ 1, the trivial limit where no information is coded, 
or the high temperature limit, where no matching condi- 
tions are satisfied, the scaling relation is altered qualita- 
tively, with 72 => ~l/2. 

We should remark on the robustness of the incipient 
power law behavior found in the limit of small degrees, for 
the out-degree distribution. Two different sources of ran- 
domness determine together the degree distributions of 
our model through the variables di and ni . While di is de- 
termined by the adjacency rule based on sequence match- 
ing, and therefore depends on the length of the sequence 
to be matched, the distribution of nodes of length I could 
have been chosen in many different ways. The exponen- 
tial dependence turns out to be algebraically tractable, 
but it may be conjectured that any distribution which 
has a tail that is decaying exponentially with I would 
give rise, all else remaining equal, to essentially the same 
scaling behavior for the out-degree distribution in the 
large / (small d) regime, and therefore that 71 ~ 1 has a 
high degree of universality. 

We would like to end this paper by pointing out the 
possible relevance of this network model to understand- 
ing molecular networks !l j , in particular transcriptional 
genomic networks. 

Transcriptional genomic networks are obtained by 
identifying the nodes with genes, and the directed edges 
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connecting two nodes with so called transcription factors 
(TF). A TF is the protein coded by the gene at the node 
of origin, and binds (i.e., becomes chemically attached 
to) a short DNA sequence within the promoter region 
typically upstream of the target gene, whose activity it 
controls by either promoting, or suppressing it. |l3lll4| 

An assay of the recently available results coming from 
high-throughput experiments on the degree distribution 
of transcriptional genomic networks reveals that the out- 
degree distribution shows putative scaling over a very 
short range of about one decade at most, with a lot of 
scatter, and a marked departure from linearity on double 
logarithmic plots, for larger degrees. Nevertheless, with 
the assumption that n{d) ~ d"^ over the whole range, 
the exponents 7 which have been reported are all smaller 
than two, and closer to unity: 7 = 1.4 fveast) U^. 7=1 
(yeast) ELt ~ 1-1-1-8 (several genomes) [l^, 7 = 1-5 
{E. coli)Jl^. 7 = 1.3 (yeast) [l^. 

Comparing these findings for the degree distribution 
of the transcriptional regulatory network with the re- 
sults of our model is very suggestive. The marked but 
short range over which the data can indeed be fitted by 
a straight line in a log-log plot of the degree distribution 
has a power close to unity, as found in the experiments 
on transcriptional regulatory networks cited above. The 
crossover to a different regime towards the tail end of the 
distribution, is a feature that also shows similarity with 
the experimental results. Clearly the oscillations of the 
out-degree distribution. Figs. ^ and (Q are not seen in 
the degree distributions of the transcription regulatory 
networks extracted from any particular genome. In the 
language of our paper, real cellular networks are more 
like single finite-size realizations, rather than expected 
distributions calculated over ensembles of many different 
realizations of a random sequence. In our model, for any 
particular finite-size realization, only a relatively small 
number of data points would fall into this portion of the 
distribution and this would not be sufficient to resolve 
well the oscillations that make up the sample-averaged 
distribution. The small degree behavior of the degree 
distribution, however, is robust with respect to sample- 
to-sample fluctuations, as we have shown. 

We think that the similarity with reported degree 
statistics of transcriptional genomic networks is not for- 
tuitous. Sequence matching provides a highly plausible 
mechanism for the formation of the transcriptional reg- 
ulatory network. Such networks rely on the recognition 
of regulatory sequences by transcription factors. These 
points will be discussed in detail within a more compre- 
hensive comparison of features of content-based network 
models with real biological data in a forthcoming arti- 
cle Ei. 
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APPENDIX 

Here we outline the calculations leading to Eqs. H22|l 
and (|26|l . In Section III, we defined the function 
W'^^'>{a,b;x) as, Eq. ^, 



W^^\a,b;x) = l^/,(x,y;/3)/fc(x,y;/3) 



(96) 



As we pointed out in the text, when performing the 
sum over y, two cases must be distinguished: (i) \b—a\ > I 
and (ii) |6 — a| < I. In case (i), the set of indices of ya.i 
and Yb.i are distinct and the evaluation of the partition 
sum proceeds in a manner analogous to Eq. H20() yielding 



W/(2) (a, 6; x) = ( ^) [l + (r - 1)6-''] '' , |6 - a| > /. 



(97) 

In case (ii) there is an overlap between the indices 
of ya,i and Ybj- Defining |6 — a| = m, we find that 
there are I — m overlapping indices, and thus there 
are fc — (^ -I- m) distinct variables jjc that are neither 
in Ya.i nor in yb.i, so that a sum over the values of 
these indices will give 7.fe~('+™). Next, it is conve- 
nient to partition the remaining indices, {ya+i, ■ ■ ■ , 2/6+;}, 
into the three disjoint sets, ^i = {^a+i, ■ • ■ , J/a+m}, 

52 = {Va+m+l = IJb+l,- ■ ■ ,ya+l = Vb+l-m+l} and 

53 = {yb+a-m+2,. •• ,yb+;}. Figure © shows an ex- 
ample for I — 7, with a — 2 and b — 5 along with 
the sets. Si = {y3,y4,y5}, S2 = {y6,y7,y8,y9} and 
S3 = {yio,yii,yi2}- With the definitions above, we find 
for 16 — al < l. 



W (2)(a,6;x) = ^ V e-'^^Li "(-*.?..+.) 



X ^e- 



pl-\-7n / J 



(^Y.T=\ 'fi(a;fa+i-^ + t),yb+i-m+t) 



(98) 



X y^ g-/3Ef=r["(^t'2'i>+t)+"(^m+t,yb+t)] 



and carrying out the sums over the y variables, we obtain 

(|6-a|<0, 



w^^'^K^;-) = iiir [! + (-- IK': 



2m 



l~m 



X II [l + {r- l)e-2/3 _ u{xt,xt+m) (l - e'^) . (99) 
t=i 
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FIG. 5: Schematic representation of the y and x averages for 
the function W^"'(a,fe;/3) as defined in the text. The case 
shown in the figure corresponds to Z = 7 with a = 2 and 
6 = 5. The number of overlapping indices in the figure is 
I — m{= 4), with m = h — a{= 3). Because of the overlapping, 
when averaging over x, these indices fall into m(= 3) disjoint 
sets: {xi,X4,X7}, {x2,X5} and {a::3,a;6} 



Next, it is useful to introduce the rxr matrix, M{x, y) 
as 

M{x,y) = I + {r ~ l)e-'^f^ ~ u{x,y) {I ~ e-f^f , (100) 

with x,y Cz {0, 1, 2, . . . , r — 1}. From the properties of m, 
Eq. (|SJl, we find that 

Mix, y) ^ { , ^^, L ' B / (101) 

and Eq. (|99|) can therefore be written as 

l — m 

X Y[M{xt,Xt+^). (102) 

Proceeding to perform the average over x, 

X 

observe that in eq. H102|l the variables x can be parti- 
tioned into k disjoints sets X with the additional prop- 
erty that if xt € X, by implication xt+m G X. The 
situation is shown schematically in Fig. (jsj for m — 3, 
where we have the 3 disjoint sets, {xi,X4,xj}, {x2,x^} 
and {x3,xg}. Denoting these sets as Xi,X2, ■ ■ ■ Xm, 
and their respective number of elements as ni, ^2, . • . , nm 



{ni+n2 + . ■ .rim = I), we see that the product inEa. (|102() 
can be factorized as 

Y\^ M{xt,Xt+m) = W M{xt,Xt+m) ' ' ' W M{xt,Xt+m) 
t=l Xt&Xi xteXm 

(104) 
Performing the summation over each of the factors we 
have for the first factor 



J2 n M{xt,Xt+rn)- 
Xi XtGXi 



(105) 



It can be easily shown that the sum over the variables 
Xt € Xi reduces to an ni — 1 fold matrix product. De- 
noting the matrix elements of the matrix M" by (M")^ , 
we therefore find 

E n M{xuXt+,n)=Y.i^'^'''~')xy (106) 

Xi Xt^Xi x,y 

and hence 

_. l—m -. m, 

;j E n ^i^UX^+ra) = ;7 n E (M-^-')xy (107) 



X t=l 



s— 1 x,y 



Owing to the structure of the matrix M, Eq. (|101|l . 
powers of M retain the same structure, as can be readily 
shown, and we therefore have 



(M" 



__ An, x = y 
i^y) \ Bn, x^ y. 



(108) 



The quantities An and i?„ can be evaluated recursively, 
and one finds after a little algebra. 



An+i \ ^r, Ml 

Bn+l J - "^^ I Si 



(109) 



where 



1 / (r - 1)A!^ + Al -{r - 1)A!^ + (r - l)A'l 
r I -A"^ + A1 Xl + ir- 1)A": 



and 



X- = [l + (r-l) 



= -/31 



(110) 

(111) 

(112) 



We therefore find, 

Y,iMn.y=rAn+r{r~l)Bn, (113) 



and thus 



E (^").. = ^ [1 + ('^ - l)e-1 



2n 



(114) 



x,y 



Substituting eq. (|114|l back into eq. (|107|l we have 



1 E n ^^(-*' -*+™) - i n - [1 + ('^ - i)^"i 



/3]2(«=-l) 



X t=l 



(115) 
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and noting that ni + 712 + . . . rim = /, we finally obtain 






1 



X t=l 



•^t+ra) — I 



r"Ml + fr-l)e 



_p^2(l-m) 



(116) 

which when substituted into Eqs. H103|l and p()2|l yields 
the final resuh, Eq. if^ . 

l^Ty(2)(a,6;x) = 4[l+(r-l)e-'']^'. (117) 



Note that we obtain the same result as for the case 
|6 — a| > I, Eq. H97|l . In particular, we see that once 
averaged over x, M/t^) is independent of a and b. 
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